Data Validation Automation
Context
We worked with millions of rows of messy, user-input data. This data needed to be validated and cleaned before it entered our ETL software.
Challenge
The data validation and cleaning process was highly manual: changing values in Excel to match the expected input within our internal system. There was no universal validation standard, which left large room for error. Data validation took 3 business days to complete, which was an exceptionally long time based on team size and capacity.
Solution
We utilized Python to write functions to standardize and automate the data validation and cleaning process. We used the pandas and NumPy libraries to conduct the essential data analysis tasks. These included validating data for accurate totals, flagging anomalies associated with operating costs, and ensuring completeness in critical fields (e.g. ID numbers, department codes, dates, etc.) Scripts scanned selected files we received from clients, flagged discrepancies, and normalized the files into a format we could then review and load into our ETL system much sooner. Data validation time was cut by 50%.